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Abstract 

RNA-sequencing (RNA-Seq) has become a powerful technology to charac¬ 
terize gene expression profiles because it is more accurate and comprehensive 
than microarrays. Although statistical methods that have been developed for 
microarray data can be applied to RNA-Seq data, they are not ideal due to the 
discrete nature of RNA-Seq data. The Poisson distribution and negative bino¬ 
mial distribution are commonly used to model count data. Recently, Witten 
(2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The 
Poisson assumption may not be as appropriate as negative binomial distribution 
when biological replicates are available and in the presence of over dispersion (i.e., 
when the variance is larger than the mean). However, it is more complicated to 
model negative binomial variables because they involve a dispersion parameter 
that needs to be estimated. 

In this paper, we propose a negative binomial linear discriminant analysis 
for RNA-Seq data. By Bayes’ rule, we construct the classifier by fitting a neg¬ 
ative binomial model, and propose some plug-in rules to estimate the unknown 
parameters in the classifier. The relationship between the negative binomial clas¬ 
sifier and the Poisson classifier is explored, with a numerical investigation of the 
impact of dispersion on the discriminant score. Simulation results show the su¬ 
periority of our proposed method. We also analyze four real RNA-Seq data sets 
to demonstrate the advantage of our method in real-world applications. 
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1 Introduction 


RNA-sequencing (RNA-Seq) is a revolutionary technology that uses the capabilities 
of next-generation sequencing to quantify gene expression levels (Mardis 2008, Wang 
et al. 2009, Morozova et al. 2009). Compared to microarray technology, RNA-Seq 
has many advantages including the detection of novel transcripts, low background 
signal, and the increased specificity and sensitivity. Due to reduced sequencing cost, 
RNA-Seq has been widely used in biomedical research in recent years Lorenz et al. 
(2014). In real-world applications, the gene expression profile of biopsy or serum sample 
from an individual can be used to test whether this individual has a disease and/or 
a specific type of disease, which is essentially a classification problem. Different from 
the microarray technology that measures the level of gene expression on a continuous 
scale, RNA-Seq counts the number of reads that are mapped to one gene and measures 
the level of gene expression with nonnegative integers. As a result, popular tools that 
assume a Gaussian distribution in microrray data analysis, such as linear discriminant 
analysis, may not perform as well as those methods that adopt appropriate discrete 
distributions for RNA-Seq data. 

For RNA-Seq data, the Poisson distribution and negative binomial distribution are 
two common distributions considered in the expression detection and classification. 
Many methods have been proposed to detect differentially expressed genes, including 
edgeR (Robinson and Smyth 2008),Robinson et al. (2010), DESeq Anders and Huber 
(2010), baySeq Hardcastlc and Kelly (2010), BBSeq Zhou et al. (2011), SAMseq Li 
and Tibshirani (2013), DSS Wu et al. (2013), AMAP Si and Liu (2013), sSeq Yu et al. 
(2013), and LFCseq Lin et al. (2014). However, there is less progress on the classi¬ 
fication using RNA-Seq data until recently. Witten (2011) proposed a Poisson linear 
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discriminant analysis (PLDA) which assumes that RNA-Seq data follow the Poisson 
distribution. Tan et al. (2014) further discussed many methods, such as logistic regres¬ 
sion and partial least squares, and showed that PLDA is a comparable method. The 
Poisson distribution is suitable for modeling RNA-Seq data when biological replicates 
are not available. However, if biological replicates are available, the Poisson distribu¬ 
tion may not be a proper choice owing to the overdispersion issue, where the variances 
of such data arc likely to exceed their means (Anders and Huber 2010, Si and Liu 2013). 
The overdispersion issue can have a significant effect on classification accuracies. In 
real-world applications, biological replicates can provide more convincing results than 
technical replicates. Therefore, it is necessary to look for some solutions to take the 
overdispersion issue into consideration. 

We note that Witten (2011) has considered this problem and pointed out that the 
classification accuracy can be further improved for overdispersed data by extending the 
Poisson model to the negative binomial model. However, to construct an appropriate 
negative binomial classifier for practical use, two major issues remain to be solved. 
The first issue is that the probability density function (pdf) of the negative binomial 
distribution is more complicated than that of the Poisson distribution, which gives 
rise to a more complicated classifier. The second issue is that the negative binomial 
distribution contains a dispersion parameter, which controls how much its variance 
exceeds its mean. To construct the classifier using the negative binomial model, we 
need to estimate the dispersion parameter. To avoid fitting the complicated negative 
binomial model, Witten (2011) proposed a transformation method for the overdispersed 
data and found that this method works well if the over dispersion is mild. 

In light of the importance of the dispersion in modelling RNA-Seq data with the 
negative binomial distribution, some dispersion estimation methods have been proposed 
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recently in the literature. For example, Wu et al. (2013) proposed a dispersion estimator 
using empirical Bayes method and applied it to find differentially expressed genes. Yu 
et al. (2013) proposed a shrinkage estimator of dispersion which shrinks the estimates 
obtained by the method of moments towards a target value, and also applied it to 
detect differentially expressed genes. These new methods on estimating the dispersion 
parameter make it possible to construct a negative binomial classifier to achieve better 
classification accuracy on RNA-Seq data. 

In this paper, we propose a negative binomial linear discriminant analysis (NBLDA) 
for RNA-Seq data. The main contributions of this paper are in, but not limited to, 
the following two aspects: 

1. We extend Witten (2011) to build a new classifier based on the negative binomial 
model. Under the assumption of independent genes, we define the discriminant 
score by Bayes’ rule and propose some plug-in rules for estimating the unknown 
parameters in the classifier. 

2. We further explore the relationship between NBLDA and PLDA. A numerical 
comparison is conducted to explore how the dispersion changes the discriminant 
score. The comparison results will provide some guidelines for scientists to decide 
which method should be used in the discriminant analysis of RNA-Seq data. 

To demonstrate the performance of our proposed method, we conduct several simula¬ 
tion studies under different numbers of genes, sample sizes, and proportions of differen¬ 
tially expressed genes. Simulation results show that the proposed NBLDA outperforms 
existing methods in many settings. Four real RNA-Seq data sets are also analyzed to 
demonstrate the advantage of NBLDA. Specifically, we propose the negative binomial 
classifier study, the relationship between NBLDA and PLDA, and present the param- 
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eter estimation in Section 2. Simulation studies and real data analysis are conducted 
in Sections 3 and 4. We conclude the paper with some discussions in Section 5. 

2 Negative Binomial Linear Discriminant Analysis 

Let X ig denote the numbers of reads mapped to gene g in sample i, i = 1 
and g = 1 ,,G. Our goal is to identify which class a new observation belongs 
to. Witten (2011) proposed a PLDA for classifying RNA-Seq data. When biological 
replicates are available, however, overdispersion occurs for RNA-Seq data and hence 
the Poisson distribution may no longer be appropriate. In this section, we propose 
a new discriminant analysis for RNA-Seq data by assuming that the data follow the 
negative binomial distribution. 

2.1 Methodology 

Consider the following negative binomial distribution for RNA-Seq data: 

X-ig ~ j k'ig (1) 

where ,s t is the size factor which is used to scale gene counts for the ith sample due to 
different sequencing depth, \ g is the total number of reads per gene, and > 0 is the 
dispersion parameter. We have E(X ig ) = fx ig and Var(X, fl ) = fi ig + g% g (j) g . Note that 
the variance is larger than the mean for the negative binomial distribution. 

Let K be the total number of classes and Ck G {1,..., n} the indices of samples in 
class k for k — 1,..., K. Then the class-specific model for RNA-Seq data is given by 

(Xig\yi = k) ~ NB(/4 9 4 9 ,0 s ), (2) 

where 4g represents the differences among K classes, and y* = k G {1,..., A'} rep¬ 
resents the label of sample i. We also follow the independence assumption in Witten 
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(2011) that all genes are independent of each other. Note that the independence as¬ 
sumption is frequently assumed in microarray data analysis. 

Let x* = (X*,..., Xq) t be a test sample with s* the size factor and y* the class 
label. By Bayes’ rule, we have 

P(y* = k\x*) oc / fc (x*)7T fc , (3) 


where f k is the pdf of the sample in class k, and ir k is the prior probability that one 
sample comes from class k. The pdf of X ig = x ig in model ((2]) is 


P(X-ig %ig\yi k'j 


T cj)g ) / Si\gdkg(j)g 
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By (ED and (ED, we have the following discriminant score for NBLDA: 
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log P(y* = k |x*) = [log d k g - l0g(l + S*\gd k g(j)g)\ 
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+ log 7Tfc + C, 


(4) 


(5) 


where C is a constant independent of k. We then assign the new observation x* to 
class k that maximizes the quantity (ED- Throughout the paper, we estimate the prior 
probability 7by n k /n, where n k is the sample size in class k. For balanced data, the 
prior probability is simplified as n k = 1 /K for all k = 1,,K. For gene g, the total 
number of reads is X g = Y2i=iXi g , an< 4 tl ie class difference d kg can be estimated by 
X ig + 1 )/(Y^iec k s ^g + 1). Estimation of the unknown parameters including s* 
and (pg will be discussed in Section 2.2. 

To explore the relationship between the proposed NBLDA and the PLDA in Wit¬ 
ten (2011), we assume that s*X g d kg are bounded. When (j) g — > 0, we have log(l + 
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s*\ g d kg (t)g ) ->• 0 and 0 ff 1 log(l + s*\ g d kg (t)g ) = log(l + s*A g 4 9 0 9 )^ 1 ->■ s*X g d kg . Then 
consequently, 


log P(y* = k |x*) 


G G 

J2 X* log d k g - S*Xgd k g 
9 =1 9=1 

T log 7T+ C, 


( 6 ) 


where the right hand of (JB|) is the discriminant score of PLDA. That is, the NBLDA 
classifier reduces to the PLDA classifier when there is little dispersion in the data. 
From this point of view, the proposed NBLDA can be treated as a generalized version 
of PLDA. 

Since NBLDA contains the dispersion parameter which PLDA does not have, in 
what follows, we investigate how the dispersion changes their discriminant scores. We 
conduct a numerical comparison between NBLDA and PLDA. Two cases are consid¬ 
ered in this paper. The first case is that all genes have a common dispersion, and 
the second is that genes have different dispersions. Note that the classifiers (J5]) and 
((6]) have the same terms: log7rfc and C. Without loss of generality, we compute the 
discriminant scores only using the first two terms in (J5]) and (l6lh respectively. In the 
comparison study, we fix X* = 10, d kg = 1.5, s* — 1, X g — 10 and G = 500. For the 
case of common dispersion, we set the dispersion ranging from 0 to 20. For the case 
of different dispersions, we let (f) g be independent and identically distributed (i.i.d.) 
random variables from a chi-squared distribution with the degrees of freedom ranging 
from 0.1 to 5. 

Figure 1 exhibits the comparison results. The left panel shows the results for the 
case of common dispersion. Note that the discriminant score of PLDA is independent 
of the dispersion parameter and hence is a constant. For NBLDA, its discriminant score 
is a curve, and the slope is large for low dispersions and small for high dispersions. We 
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Figure 1: Numerical comparisons between NBLDA and PLDA. The left panel shows 
the results with a common dispersion 0. The right panel shows the results with dif¬ 
ferent gene-specific dispersions (j) g which are i.i.d. random variables from a chi-squared 
distribution with r degrees of freedom. We compute the discriminant scores of NBLDA 
and PLDA for different 0 and r. 


discover that the discriminant score of NBLDA is sensitive to the dispersion. Even 
when the dispersion is very small, the difference between the two discriminant scores 
is significant. The right panel in Figure 1 shows the results for the case of different 
dispersions. The pattern of the right panel is similar to the left one except that the 
curve of NBLDA is not smooth. This suggests that when we analyze real data, we 
should first compute its average dispersion and then use such information to determine 
which classifier to use. 

2.2 Parameter Estimation 

Note that the discriminant score in ([5]) involves two unknown parameters, size factor 
s* and dispersion parameter 0 5 . 

















2.2.1 Size factor estimation 


Due to different sequencing depths, samples have different total numbers of reads. 
Hence a normalization of the read counts through a size factor is a necessary step for 
analyzing RNA-Seq data (Bullard et ah 2010, Dillies et ah 2013). To estimate the size 
factor Si for the training data and the size factor s* for the test data, we consider the 
following three procedures: 


• Total count Witten (2011) divided the total read counts of sample i by the total 
read counts of all samples to estimate the size factor of sample i. That is, 

E g y * 

_ g=l "W 

^ y ’ 

2^ii=i Z^ 9 =i ^ig 
Y 
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• DESeq: Anders and Huber (2010) first divided the read counts of sample i by the 
geometric mean of all samples’ read counts, and then estimated the size factor 
by computing the median of those G values. Specifically, the size factors are 
estimated by 


= median. 


X* 


(n:u^) iA 


= median. 


X 


*9 


(nr.i v,) 1 /” 


• Upper quartile : Bullard et al. (2010) proposed a robust method that uses the 
upper quartile of the read counts to estimate the size factors. Specifically, the 
size factors are estimated by 


s = 


Si 


Q 


E n ? 

*=i T 
Qi 

E n ? 

i=i Qi 
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where q* and g* are the upper quartiles for the test data and sample i in the 
training data, respectively. 

In our simulation studies, we find that the three methods provide little difference in 
the performance of classification. Hence, for brevity, we only report in the remainder 
of the paper the simulation results based on the total count method. 

2.2.2 Dispersion parameter estimation 

Various methods for estimating the dispersion parameter cf) g have been proposed in the 
literature (Robinson and Smyth 2008, Robinson et al. 2010, Anders and Huber 2010, 
Hardcastle and Kelly 2010). A comparative study is also available in Landau and Liu 
(2013) where the authors investigated the influence of different dispersion parameter 
estimates on detecting differentially expressed genes in RNA-Seq data. More recently, 
Yu et al. (2013) proposed a shrinkage estimator for <j) g that shrinks the gene-specific 
estimation towards a target value. Specifically, the dispersion estimator is estimated 
by 

4>g = 8£+0-- ( 7 ) 

where 5 is a weight defined as 

s S°,i{^-( 1 /G)E°i4} 2 /(G- 1 ) 

Ehi(^»-«) 2 /(G-2) 

< i> g are the initial dispersion estimates obtained by the method of moments, and £ is 
the target value calculated by minimizing the average squared difference between cj) g 
and (j) g . In this paper, we use the estimator (J7|) to estimate the dispersion parameter. 
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3 Simulation Studies 


In this section, we evaluate and compare the following classification methods: 

• NBLDA, 

• PLDA, 

• Support vector machines (SVM), 

• K-nearest neighbors (KNN). 

For PLDA, we use the R package “PoiClaClu” provided in Witten (2011). For SVM, 
we use the R package “el071” and choose the radial basis kernel in our simulation 
studies. For KNN, we choose k — 1, 3 and 5. 

3.1 Simulation Design 

We generate the data from the following negative binomial distribution: 

(x ig \Ui = k) ~ NB( Si A 9 4 s ,0). (8) 

The total number of classes is K = 2, and both the training data and test data have 
n samples. In all G genes, the proportions of differentially expressed genes are 0.2, 
0.4, 0.6, 0.8 and 1.0, which represents that 20%, 40%, 60%, 80% and 100% genes 
are differentially expressed, respectively. For the differentially expressed genes, we 
set log dkg = Zk g , where Zk g are i.i.d. random variables from the normal distribution 
V(0,<r 2 ). For the constant genes, we set dk g = 1. The size factors s* are i.i.d. random 
variables from the uniform distribution on [0.2, 2.2], The X g values are i.i.d. random 
variables from the exponential distribution with rate 0.04. Note that, for the sake of 


11 



n=8, G=20 


n=24, G=20 



n=8, G=50 n=24, G=50 




Figure 2: Mean misclassification rates for all four methods with <p = 20 and a — 5. 
The x-axis represents the proportion of differentially expressed genes. 20%, 40%, 60%, 
80% and 100% differentially expressed genes are considered, respectively. These plots 
investigate the effect of proportion of differentially expressed genes. 


fairness, we have essentially followed the same simulation settings as those in Witten 
(2011). For the values of G , n, (j) and a, we specify them in Figures 2, 3 and 4. 

To compare these methods, we compute the mean misclassification rates as follows: 
for each simulation, we generate n test samples and compute the following misclassifi¬ 
cation rate: 

the number of misclassihed samples 
n 

We run 1,000 simulations, compute its mean, and then obtain the mean misclassifica¬ 
tion rate. 
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80% DE, n=8 


80% DE, n=24 




Figure 3: Mean misclassification rates for all four methods with <j) = 20 and a — 5. 
“80% DE” means 80% genes are differentially expressed, and the same to “40% DE”. 
This plot investigates the effect of numbers of genes. 
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Figure 4: Mean misclassification rates for all four methods with a = 5. “80% DE” 
means 80% genes are differentially expressed, and the same to “40% DE”. This plot 
investigates the effect of overdispersion. 
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3.2 Simulation Results 


Figure 2 investigates the effect of the proportion of differentially expressed genes on 
the mean misclassification rate. In general, with an increasing number of differentially 
expressed genes, both methods have decreased mean classification rates. NBLDA al¬ 
ways outperforms the other three methods. In particular, when the sample size is small 
(n = 8), NBLDA has a significant improvement over the other approaches. 

Figure 3 investigates the impact of the number of genes on the mean misclassifi¬ 
cation rate. We consider G = 20, 30, 50, and 100 for this investigation. From Figure 
3, we observe that an increasing number of genes will lead to a lower misclassification 
rate. NBLDA shows its superiority over the other three methods, and the improvement 
is more significant when the sample size and the number of genes are smaller. 

Figure 4 investigates the effect of overdispersion on the mean misclassification rate. 
We consider 0=1,5, 10, 20 and 30 for this investigation. Figure 4 shows that a larger 
dispersion will result in a higher mean misclassification rate. Both NBLDA and PLDA 
perform better than SVM and KNN. When the over dispersion is not very high, NBLDA 
and PLDA have similar performance, with NBLDA slightly better than PLDA. When 
the over dispersion is high, however, the performance of NBLDA is much better than 
PLDA. 

For real biomedical research in which RNA-Seq technology is used, it is common 
that thousands or tens of thousands of genes are measured simultaneously. We perform 
a gene selection procedure to screen the informative genes before applying a classifica¬ 
tion rule to RNA-Seq data. By doing gene selection, we rule out the noise as much as 
possible so that the variance of the discriminant score is reduced, and consequently we 
have an increased interpretability. For more details, see Section 4. 
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4 Real Data Analysis 

We first describe four data sets. The first three are RNA-Seq data and the last one is 
a chromatin immunoprecipitation (ChIP) sequencing data set. 

• Liver and kidney data Marioni et al. (2008). There are two classes in this data 
set. One class contains 7 technical replicates which come from a liver sample. 
The other class contains 7 technical replicates which come from a kidney sample. 
A total of 22,925 genes are measured in this data set. The data set is available 
as a Supplementary File in Marioni et al. (2008). 

• Yeast data Nagalakshmi et al. (2008). The data set contains two library prepa¬ 
rations: random haxamer (RH) and oligo (dT), which are treated as two classes 
in this paper. In each class, three samples are included: one original sample, its 
technical replicate, and its biological replicate. A total of 6,874 genes are quanti¬ 
fied in this data set. The data set is available as a Supplementary File in Anders 
and Huber (2010). 

• Cervical cancer data Witten et al. (2010). Two groups of samples are contained in 
this data set. One is the nontumor group which includes 29 samples, and the other 
one is the tumor group which includes 29 samples. There are 714 microRNAs 
in this data set. This data set is available in Gene Expression Omnibus (GEO) 
Datasets with access number GSE20592. 

• Transcription factor binding data Kasowski et al. (2010). This data set contains 
10 classes with a total of 39 samples. 19,061 binding regions are included in 
this data set and those regions are treated as distinct features. This data set is 
available as a Supplementary File in Anders and Huber (2010). 
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4.1 Gene Selection 


The BSS/WSS method, which is proposed by Dudoit et al. (2002), is a common gene 
selection method and has been widely used in the literature (Lee et al. 2005, Pang 
et al. 2009, Huang et al. 2010). This method computes the ratio of the sum of squares 
between groups to the sum of squares within groups for each gene, and selects genes 
whose ratios are in the top. However, this method assumes the data to be normally 
distributed so that it may not be suitable for RNA-Seq data. 

Witten (2011) proposed a screening method to select genes for RNA-Seq data. Since 
gene g will be deleted from the classification rule, dk g = 1, they shrink the estimate 
of dkg towards 1 by using soft-thresholding to perform the gene selection procedure. 
However, this method can not be applied to our discriminant analysis because the 
dispersion is involved in our discriminant rule. For the negative binomial distribution, 
edgeR (Robinson and Smyth 2008, Robinson et al. 2010) has been proposed to detect 
differentially expressed genes in RNA-Seq data. This method first estimates the gene- 
wise dispersions by maximizing the combination of gene-specific conditional likelihood 
and common conditional likelihood, and then replaces the hypergeometric distribution 
in Fisher’s exact test by the negative binomial distribution to construct an exact test. 
In this paper, we use edgeR to perform the gene selection procedure, which is available 
in Bioconductor (www.bioconcluctor.org). 

4.2 Results 

We first conduct the gene selection procedure using edgeR and obtain G genes for 
further analysis. We then randomly split the sample into two sets: the training set and 
the test set. The training set is used to construct the classifier and the test set is used 
to compute the misclassification rate. We repeat the whole procedure 1,000 times and 
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Figure 5: Mean misclassification rates for Cervical cancer data and Transcription factor 
binding data. 


compute the mean misclassification rate for the four methods, NBLDA, PLDA, SVM, 
and KNN, respectively. 

The comparison results are shown in Figure 5. Because the mean misclassification 
rates of the four methods are all zeros for Liver and kidney data and Yeast data, we 
only show the results for other two data sets in Figure 5. For Cervical cancer data, 
52 samples are assigned to the training set and 6 samples to the test set. A total of 
20, 30, 50, 100, 200 and 500 genes are selected, respectively. Among all approaches we 
consider in this paper, our proposed NBLDA has the lowest misclassification rate. A 
big improvement over the other approaches can be observed when more than 50 genes 
are selected. For Transcription factor binding data, to conduct the binary classification, 
we randomly assign 30 samples to the training set and the remaining 9 samples to the 
test set. We choose 20, 50, 100, 500 and 1,000 genes, respectively for this data set. 
In Figure 5, we observe that NBLDA also outperforms PLDA for Transcription factor 
binding data. Expect when the number of genes is small, NBLDA has a better or 
comparable performance than the other three methods. 

Finally, we estimate the average dispersion of the two data sets to check if it also 
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Table 1: The average dispersions for Cervical cancer data and Transcription factor 
binding data, where ”(7” represents the number of top genes selected by edgeR. 


Data sets 

(7=20 

(7=50 

Q 

o 

o 

(7=500 

Cervical cancer 

25.71 

24.42 

19.02 

11.03 

Transcription factor binding 

8.12 

5.71 

4.48 

2.86 


supports our comparison results made in the previous paragraph. The simplest way for 
estimating the dispersion is to use the method of moments. However, this estimate may 
not be reliable (sometimes is a negative value) when the sample size is small. Landau 
and Liu (2013) and Yu et al. (2013) recently reviewed several dispersion estimation 
methods. For Cervical cancer data and Transcription factor binding data, we compute 
their average dispersions using the method in Yu et al. (2013) and present the estimates 
in Table 1. We note that both data sets possess a considerably high average dispersion 
when the number of selected genes is not very large. This, together with the numerical 
comparison in Figure 1, explains why NBLDA provides a better performance than 
PLDA for these two data sets. 

5 Discussion 

Next generation sequencing technology has been widely applied in biomedical research 
and RNA-Seq begins to replace the microarray technology gradually in recent years. 
Since RNA-Seq data are nonnegative integers, differing from that of microarray data, 
it is necessary to develop methods that are well suited for RNA-Seq data. Two dis¬ 
crete distributions, the Poisson distribution and negative binomial distribution, are 
commonly used in the literature to model RNA-Seq data. Compared to the Poisson 
distribution, the negative binomial distribution allows its variance to exceed its mean 
and is more suitable for the situations when biological replicates are available. Never¬ 
theless, the negative binomial model is more complicated than the Poisson model as 
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the additional dispersion parameter also needs to be estimated. 

In this paper, we have proposed an NBLDA classifier using the negative binomial 
model. Our simulation results show that onr proposed NBLDA has a better perfor¬ 
mance than PLDA in the presence of moderate or high dispersions. When there is 
little dispersion in the data, NBLDA is also comparable to PLDA. We have further 
explored the relationship between NBLDA and PLDA, and investigated the impact of 
dispersion on the discriminant score of NBLDA by conducting a numerical comparison. 
It is worth noting that even for a small dispersion, the two discriminant scores can be 
rather different. This suggests that for real RNA-Seq data with moderate or high dis¬ 
persion, NBLDA may be a more appropriate method than PLDA. Note that the true 
dispersions are unlikely to be known in practice. Therefore, we propose to first estimate 
the average dispersion using some novel estimation methods in the recent literature. 
Second, if the estimated average dispersion is small, we use PLDA; and otherwise we 
use NBLDA. 

We note that the independence assumption in Witten (2011) and in this paper is 
very restrictive. For real gene expression data sets, it may not be realistic to assume 
that all genes are independent of each other. In our future study, we would like to 
incorporate the network information of pathways or gene sets to further improve the 
performance of classification. The clustering of sequencing data is also an important 
issue in biomedical research. Hence, another possible future work is to extend the 
clustering method in Witten (2011) to follow the negative binomial model. To conclude, 
our proposed method is general and can be applied to other next generation sequencing 
data sets including ChIP-Seq data. 


20 



Acknowledgements 


Hongyu Zhao’s research was supported by the National Institutes of Health grants RC2 
DA028909, R01 DA12690, R01 DA12849, R01 DA18432, R01 AA11330, R01AA017535, 
P50 AA12870, R01 GM59507, R01 DA030976, and UL1 RR024139. Xiang Wan’s 
research was supported by the Hong Kong RGC grant HKBU12202114, the Hong 
Kong Baptist University grant FRG2/13-14/005, and Hong Kong Baptist University 
Strategic Development Fund. Tiejun Tong’s research was supported by the Hong Kong 
RGC grant HKBU202711 and the Hong Kong Baptist University grants FRG2/11- 
12/110, FRGl/13-14/018, and FRG2/13-14/062. 

References 

Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count 
data, Genome Biology 11: R106. 

Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of 
statistical methods for normalization and differential expression in mRNA-Seq 
experiments, BMC Bioinformatics 11: 94. 

Dillies, M. A., Ran, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, 
N., Keimc, C., Marot, G., Castel, D., Estelle, J. et al. (2013). A comprehensive 
evaluation of normalization methods for Illumina high-throughput RNA sequenc¬ 
ing data analysis, Briefings in Bioinformatics 14: 671-683. 

Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination meth¬ 
ods for the classification of tumors using gene expression data, Journal of the 
American Statistical Association 97: 77-87. 


21 



Hardcastle, T. J. and Kelly, K. A. (2010). baySeq: Empirical Bayesian methods for 
identifying differential expression in sequence count data, BMC Bioinformatics 
11: 422. 

Huang, S., Tong, T. and Zhao, H. (2010). Bias-corrected diagonal discriminant rules 
for high-dimensional classification, Biometrics 66: 1096-1106. 

Kasowski, M., Grubert, F., Heffelhnger, C., Hariharan, M., Asabere, A., Waszak, 
S. M., Habegger, L., Rozowsky, J., Shi, M., Urban, A. E. et ah (2010). Variation 
in transcription factor binding among humans, Science 328: 232-235. 

Landau, W. M. and Liu, P. (2013). Dispersion estimation and its effect on test per¬ 
formance in RNA-Seq data analysis: A simulation-based comparison of methods, 
PLOS ONE 8: e81415. 

Lee, J. W., Lee, J. B., Park, M. and Song, S. H. (2005). An extensive comparison 
of recent classification tools applied to microarray data, Computational Statistics 
and Data Analysis 48: 869-885. 

Li, J. and Tibshirani, R. (2013). Finding consistent patterns: A nonparametric ap¬ 
proach for identifying differential expression in RNA-Seq data, Statistical Methods 
in Medical Research 22: 519-536. 

Lin, B., Zhang, L. and Chen, X. (2014). LFCseq: a nonparametric approach for differ¬ 
ential expression analysis of RNA-seq data, BMC Genomics 15(Suppl 10): S7. 

Lorenz, D. J., Gill, R. S., Mitra, R. and Datta, S. (2014). Using RNA-seq data to detect 
differentially expressed genes, Statistical Analysis of Next Generation Sequencing 
Data, Springer, pp. 25-49. 


22 



Mardis, E. R. (2008). Next-generation DNA sequencing methods, Annual Review of 
Genomics and Human Genetics 9: 387-402. 

Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA- 
seq: an assessment of technical reproducibility and comparison with gene expres¬ 
sion arrays, Genome Research 18: 1509-1517. 

Morozova, O., Hirst, M. and Marra, M. A. (2009). Applications of new sequencing 
technologies for transcriptome analysis, Annual Review of Genomics and Human 
Genetics 10: 135-151. 

Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. and Snyder, 
M. (2008). The transcriptional landscape of the yeast genome defined by RNA 
sequencing, Science 320: 1344-1349. 

Pang, H., Tong, T. and Zhao, H. (2009). Shrinkage-based diagonal discriminant analysis 
and its applications in high-dimensional data, Biometrics 65: 1021-1029. 

Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: a Bioconductor 
package for differential expression analysis of digital gene expression data, Bioin¬ 
formatics 26: 139-140. 

Robinson, M. D. and Smyth, G. K. (2008). Small-sample estimation of negative bino¬ 
mial dispersion with applications to SAGE data, Biostatistics 9: 321-332. 

Si, Y. and Liu, P. (2013). An optimal test with maximum average power while con¬ 
trolling FDR with application to RNA-Seq data, Biometrics 69: 594-605. 

Tan, K. M., Petersen, A. and Witten, D. (2014). Classification of RNA-seq data, 
Statistical Analysis of Next Generation Sequencing Data, Springer, pp. 219-246. 


23 



Wang, Z., Gerstein, M. and Snyder, M. (2009). RNA-Seq: a revolutionary tool for 
transcriptomics, Nature Reviews Genetics 10: 57-63. 

Witten, D. M. (2011). Classification and clustering of sequencing data using a Poisson 
model, The Annals of Applied Statistics 5: 2493-2518. 

Witten, D., Tibshirani, R., Gu, S. G., Fire, A. and Lui, W. (2010). Ultra-high through¬ 
put sequencing-based small rna discovery and discrete statistical biomarker anal¬ 
ysis in a collection of cervical tumours and matched controls, BMC Biology 8: 58. 

Wu, H., Wang, C. and Wu, Z. (2013). A new shrinkage estimator for dispersion 
improves differential expression detection in RNA-Seq data, Biostatistics 14: 232- 
243. 

Yu, D., Huber, W. and Vitek, O. (2013). Shrinkage estimation of dispersion in Negative 
Binomial models for RNA-Seq experiments with small sample size, Bioinformatics 
29: 1275-1282. 

Zhou, Y., Xia, K. and Wright, F. A. (2011). A powerful and flexible approach to the 
analysis of RNA sequence count data, Bioinformatics 27: 2672-2678. 


24 



